Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
Hi I am Garima. In the previous period, I was a student of Quantitative skills and that’s where I heard about the course This course, according to my understanding will help me explore data more freely and will enhance my knowledge on wrangling with data in a more professional and precise manner. As this is an advanced level course, I truly believe it will further help me with my masters thesis too especially while working out my methodology
My favorite chapter from the RHDS book are chapter 3, 4 and 5 as they are once that truly helped me with my basics. I truly enjoy using the mutate function and also visualizing different plots
My Git Repository link : https://github.com/garimahelsinki/IODS-project.git
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Fri Nov 25 15:28:41 2022"
The text continues here.
Describe the work you have done this week and summarize your learning.
date()
## [1] "Fri Nov 25 15:28:41 2022"
Here we go again…
setwd("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data")
library(readr)
Data2 <- read_csv("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data/learning2014.csv",show_col_types = FALSE)
Data2
## # A tibble: 166 × 7
## gender age attitude deep stra surf points
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 53 3.7 3.58 3.38 2.58 25
## 2 M 55 3.1 2.92 2.75 3.17 12
## 3 F 49 2.5 3.5 3.62 2.25 24
## 4 M 53 3.5 3.5 3.12 2.25 10
## 5 M 49 3.7 3.67 3.62 2.83 22
## 6 F 38 3.8 4.75 3.62 2.42 21
## 7 M 50 3.5 3.83 2.25 1.92 21
## 8 F 37 2.9 3.25 4 2.83 31
## 9 M 37 3.8 4.33 4.25 2.17 24
## 10 F 42 2.1 4 3.5 3 26
## # … with 156 more rows
Observation The data contains 166 rows and 7 columns representing various variables like gender, age, attitude, deep, stra, surf, points, etc
# Access the gglot2 library
library(ggplot2)
# initialize plot with data and aesthetic mapping
p1 <- ggplot(Data2, aes(x = attitude, y = points, col = gender))
# define the visualization type (points)
p2 <- p1 + geom_point()
# draw the plot
p2
# add a regression line
p3 <- p2 + geom_smooth(method = "lm")
p3
## `geom_smooth()` using formula 'y ~ x'
# add a main title
p4 <- p3 + ggtitle("Student's attitude versus exam points")
# draw the plot!
p4
## `geom_smooth()` using formula 'y ~ x'
pairs(Data2[-1])
# access the GGally and ggplot2 libraries
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
# create a more advanced plot matrix with ggpairs()
p <- ggpairs(Data2, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
#draw the plot
p
Interpretation of the Correlation Plot above Here,
in the above plot, the variable names are displayed on the outer edges
of the matrix. The boxes along the diagonals display the density plot
for each variable whereas the boxes in the lower-left corner display the
scatterplot between each -pair of variables. The boxes in the upper
right corner display the Pearson correlation coefficient between each
variable.
The Pearson correlation gives us the measure of the linear relationship between two variables. It has a value between -1 to 1, where a value of -1 signifies a total negative linear correlation, 0 signifies no correlation, and + 1 signifies a total positive correlation.
We can see That there is no correlation between deep with points and age. Besides surf doesn’t have a linear relation with any of the other variables as but the relation is although not a total negative yet linearity is missing between them
# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = Data2) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# fit a linear model
my_model <- lm(points ~ attitude, data = Data2)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude, data = Data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Interpretations The p-value suggests that there there is a significant relation between points and attitude
library(GGally)
library(ggplot2)
# create an plot matrix with ggpairs()
ggpairs(Data2, lower = list(combo = wrap("facethist", bins = 20)))
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = Data2)
# print out a summary of the model
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = Data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
#adding third variable as surf
my_model3 <- lm(points ~ attitude + stra + surf, data = Data2)
#printing out a summary of the model
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = Data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Interpretation of multiple regression There is a significant increase in points with 1 unit increase in attitude. With 1 unit increase in attitude there is not a significant increase in stra and surf
##Graphical model validation
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = Data2)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
plot(my_model2, c(1,2,5))
Interpretation of residual models
1.Residuals vs Fitted Here we see that linearity seems to hold reasonably well, as the red line is close to the dashed line. We can also note the heteroskedasticity: as we move to the right on the x-axis, the spread of the residuals seems to be increasing. Finally, points 145, 56, and 35 may be outliers, with large residual values. Let’s look at another data set
2. Normal QQ-plot In the diagram below, the quantile values of the standard normal distribution are plotted on the x-axis in the Normal QQ plot, and the corresponding quantile values of the dataset are plotted on the y-axis. You can see that the points fall close to the 45-degree reference line. In the above plot we can see that although most of the observations are equally distributed and falls on the reference line but in the beginning and in the end of the plot, there are some discrepancies, especially with points 145, 35 and 56 in the beginning.
3. Residuals vs Leverage plot *We’re looking at how the spread of standardized residuals changes as the leverage, or sensitivity of the fitted _i to a change in y_i, increases. Second, points with high leverage may be influential:For this we can look at Cook’s distance, which measures the effect of deleting a point on the combined parameter vector. Cook’s distance is the dotted red line here, and points outside the dotted line have high influence. In this case there are no such influential points*
# Create model object m
m <- lm(points ~ attitude, data = Data2)
# print out a summary of the model
summary(m)
##
## Call:
## lm(formula = points ~ attitude, data = Data2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
# New observations
new_attitudes <- c("Mia" = 3.8, "Mike"= 4.4, "Riikka" = 2.2, "Pekka" = 2.9)
new_data <- data.frame(attitude = new_attitudes)
# Print out the new data
summary(new_data)
## attitude
## Min. :2.200
## 1st Qu.:2.725
## Median :3.350
## Mean :3.325
## 3rd Qu.:3.950
## Max. :4.400
# Predict the new students exam points based on attitude
predict(m, newdata = new_data)
## Mia Mike Riikka Pekka
## 25.03390 27.14918 19.39317 21.86099
Interpreation *The predict functions tells that for the new data and new students as well there is a substantial increase in one unit of attitude with increase in one unit of points.
End of chapter
date()
## [1] "Fri Nov 25 15:29:01 2022"
About the Data set
The data was collected for the following paper: P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.
The secondary student achievement of two Portuguese schools is the focus of these data. The data variables were gathered utilizing school reports and surveys and include student grades, demographic, social, and school-related features. Regarding the performance in two different courses, Mathematics (mat) and Portuguese language(por), two datasets are supplied . The two datasets were modeled in [Cortez and Silva, 2008] under binary/five-level classification and regression tasks. Please take note that the target attribute G3 strongly correlates with the traits G2 and G1. This is due to the fact that G3 is the final year grade (given at the third period), but G1 and G2 are the grades for the first and second periods, respectively.
Attributes for both student-mat.csv (Math course) and student-por.csv (Portuguese language course) datasets:
1 school - student’s school (binary: ‘GP’ - Gabriel Pereira or ‘MS’ - Mousinho da Silveira )
2 sex - student’s sex (binary: ‘F’ - female or ‘M’ - male)
3 age - student’s age (numeric: from 15 to 22)
4 address - student’s home address type (binary: ‘U’ - urban or ‘R’ - rural)
5 famsize - family size (binary: ‘LE3’ - less or equal to 3 or ‘GT3’ - greater than 3)
6 Pstatus - parent’s cohabitation status (binary: ‘T’ - living together or ‘A’ - apart)
7 Medu - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
8 Fedu - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
9 Mjob - mother’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
10 Fjob - father’s job (nominal: ‘teacher’, ‘health’ care related, civil ‘services’ (e.g. administrative or police), ‘at_home’ or ‘other’)
11 reason - reason to choose this school (nominal: close to ‘home’, school ‘reputation’, ‘course’ preference or ‘other’)
12 guardian - student’s guardian (nominal: ‘mother’, ‘father’ or ‘other’)
13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15 failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16 schoolsup - extra educational support (binary: yes or no)
17 famsup - family educational support (binary: yes or no)
18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
19 activities - extra-curricular activities (binary: yes or no)
20 nursery - attended nursery school (binary: yes or no)
21 higher - wants to take higher education (binary: yes or no)
22 internet - Internet access at home (binary: yes or no)
23 romantic - with a romantic relationship (binary: yes or no)
24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25 freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29 health - current health status (numeric: from 1 - very bad to 5 - very good)
30 absences - number of school absences (numeric: from 0 to 93)
These grades are related with the course subject, Math or Portuguese:
31 G1 - first period grade (numeric: from 0 to 20)
31 G2 - second period grade (numeric: from 0 to 20)
32 G3 - final grade (numeric: from 0 to 20, output target)
setwd("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data")
library(readr)
alc <- read_csv("C:/Users/Aadhar/Desktop/Uni/Open Data Science/data/alc.csv",show_col_types = FALSE)
alc
## # A tibble: 370 × 41
## school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason
## <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 GP F 18 U GT3 A 4 4 at_home teach… course
## 2 GP F 17 U GT3 T 1 1 at_home other course
## 3 GP F 15 U LE3 T 1 1 at_home other other
## 4 GP F 15 U GT3 T 4 2 health servi… home
## 5 GP F 16 U GT3 T 3 3 other other home
## 6 GP M 16 U LE3 T 4 3 services other reput…
## 7 GP M 16 U LE3 T 2 2 other other home
## 8 GP F 17 U GT3 A 4 4 other teach… home
## 9 GP M 15 U LE3 A 3 2 services other home
## 10 GP M 15 U GT3 T 3 4 other other home
## # … with 360 more rows, and 30 more variables: guardian <chr>,
## # traveltime <dbl>, studytime <dbl>, schoolsup <chr>, famsup <chr>,
## # activities <chr>, nursery <chr>, higher <chr>, internet <chr>,
## # romantic <chr>, famrel <dbl>, freetime <dbl>, goout <dbl>, Dalc <dbl>,
## # Walc <dbl>, health <dbl>, failures <dbl>, paid <chr>, absences <dbl>,
## # G1 <dbl>, G2 <dbl>, G3 <dbl>, alc_use <dbl>, high_use <lgl>,
## # probability <dbl>, prediction <lgl>, prob2 <dbl>, prob3 <lgl>, …
Observation The data set, named here as “alc”, has 370 observations of 35 variables
# access the tidyverse libraries tidyr, dplyr, ggplot2
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# glimpse at the alc data
glimpse(alc)
## Rows: 370
## Columns: 41
## $ school <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G…
## $ sex <chr> "F", "F", "F", "F", "F", "M", "M", "F", "M", "M", "F", "…
## $ age <dbl> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15, 15, …
## $ address <chr> "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "U", "…
## $ famsize <chr> "GT3", "GT3", "LE3", "GT3", "GT3", "LE3", "LE3", "GT3", …
## $ Pstatus <chr> "A", "T", "T", "T", "T", "T", "T", "A", "A", "T", "T", "…
## $ Medu <dbl> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3, 3,…
## $ Fedu <dbl> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, 3, 2,…
## $ Mjob <chr> "at_home", "at_home", "at_home", "health", "other", "ser…
## $ Fjob <chr> "teacher", "other", "other", "services", "other", "other…
## $ reason <chr> "course", "course", "other", "home", "home", "reputation…
## $ guardian <chr> "mother", "father", "mother", "mother", "father", "mothe…
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, 3, 1,…
## $ studytime <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, 2, 1,…
## $ schoolsup <chr> "yes", "no", "yes", "no", "no", "no", "no", "yes", "no",…
## $ famsup <chr> "no", "yes", "no", "yes", "yes", "yes", "no", "yes", "ye…
## $ activities <chr> "no", "no", "no", "yes", "no", "yes", "no", "no", "no", …
## $ nursery <chr> "yes", "no", "yes", "yes", "yes", "yes", "yes", "yes", "…
## $ higher <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes", "yes", …
## $ internet <chr> "no", "yes", "yes", "yes", "no", "yes", "yes", "no", "ye…
## $ romantic <chr> "no", "no", "no", "yes", "no", "no", "no", "no", "no", "…
## $ famrel <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, 5, 5,…
## $ freetime <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3, 5,…
## $ goout <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2, 5,…
## $ Dalc <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2,…
## $ Walc <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, 1, 4,…
## $ health <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, 4, 5,…
## $ failures <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3,…
## $ paid <chr> "no", "no", "yes", "yes", "yes", "yes", "no", "no", "yes…
## $ absences <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3, 9,…
## $ G1 <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11, 14, …
## $ G2 <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11, 15,…
## $ G3 <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12, 16…
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1.5, 1…
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE, F…
## $ probability <dbl> 0.2699557, 0.2894543, 0.2291836, 0.3126444, 0.2858545, 0…
## $ prediction <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ prob2 <dbl> 0.3555814, 0.2953958, 0.2997191, 0.3040783, 0.2822524, 0…
## $ prob3 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
## $ probability_2 <dbl> 0.1849034, 0.1533271, 0.1849034, 0.2569653, 0.2001637, 0…
## $ prediction_2 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, …
# use gather() to gather columns into key-value pairs and then glimpse() at the resulting data
gather(alc) %>% glimpse
## Rows: 15,170
## Columns: 2
## $ key <chr> "school", "school", "school", "school", "school", "school", "sch…
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP"…
# it may help to take a closer look by View() and browse the data
gather(alc) %>% View
# draw a bar plot of each variable
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
Above is the plot of each variable in the data set and represent how
each variable is distributed
I am choosing the following variable schoolsup - extra educational support(binary: yes or no) higher - wants to take higher education (binary: yes or no) famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) final grade numeric: from 0 to 20, output target
The variable ‘schoolsup’ will help me to understand the impact of high alcohol use on how much support the student need in their studies. My hypothesis is that with increased alcohol use, there will be increase the support needed
With the help of the variable ‘higher’, I want to study whether the willingness to pursue higher studies changes between those who consume more alcohol than others. My hypothesis is that there will be an effect but not so significant on desire to pursue higher education with increased alcohol consumption
the ‘famrel’ variable will help me understand the effect of high alcohol usage on the relationship with family. My hypothesis is that there will be a strain in relationship with family with high consumption
‘final grade’ variable with help me see if there is change in final grades of the student with higher usage as compared to lower usage My hypothesis is that the marks will be negatively effected with high consumption of alcohol
#Exploring the chosen variables through visualization
#1.Exploring high_use with schoolsup i.e., extra support need from school
# access the tidyverse libraries dplyr and ggplot2
library(dplyr); library(ggplot2)
g1 <- ggplot(alc, aes(x = high_use))
g1 + geom_bar() + facet_wrap("schoolsup") + ggtitle("Effect of alcohol consumption on Educational support required") + xlab("High alcohol usage") + aes(fill = high_use)
Observation From the above visualization we can observe that, there are comparatively less students whose alcohol usage is high and requires extra educational support. Interestingly, students for those alcohol usage isn’t high, they seek more support from the school when compared to those with high usage. My hypothesis stands false for this which is very interesting for me
#2. Exploring high_use with higher i.e. desire for higher eduation
g2 <- ggplot(alc, aes(x = high_use))
g2 + geom_bar() + facet_wrap("higher") + ggtitle("Desire to take higher education vs high alcohol usage") + xlab("High alcohol usage") + aes(fill = high_use)
Observation The above variable stands true to my hypothesis that no matter how much is the consumption for alcohol, the desire for higher education doesn’t go away. Lets take a closer look below
#Summarising to see results see more clearly
alc %>% group_by(higher, high_use) %>% summarise(count = n())
## `summarise()` has grouped output by 'higher'. You can override using the
## `.groups` argument.
## # A tibble: 4 × 3
## # Groups: higher [2]
## higher high_use count
## <chr> <lgl> <int>
## 1 no FALSE 7
## 2 no TRUE 9
## 3 yes FALSE 252
## 4 yes TRUE 102
Observation There are only 16 students that do not desire to pursue higher education out of which 9 were with high usage. This means that there is no significant relationship between alcohol usage and desire to study further, which seems logical to me as food choices can be temporary but the choice for further studies is a thought through decision for most people.
#3. Exploring high_use and its effect on famrel i.e., family relationship
g3 <- ggplot(alc, aes(x = high_use))
g3 + geom_bar() + facet_wrap("famrel") + ggtitle("family relationships (1-5) vs high alcohol usage") + xlab("High alcohol usage") + aes(fill = high_use)
Observation Family relationship for most students have been on the positive side without any significant change with high or low usage of alcohol. Infact in the category of bad relationship, the proportion of those with low usage is higher but that is true as there are more number of students with low usage as compared to high usage in general. Therefore, my hypothesis of effect being negative stands false
#4. Exploring high_use and its effect on G3 i,e., Final Grades of student
g4 <- ggplot(alc, aes(x = high_use, y = G3))
g4 + geom_boxplot() + ggtitle("Effect of high alcohol usage on Final grades") + xlab("High alcohol usage") + ylab("Final grade")+ aes(col = high_use)
Observation We can clearly visualize that the
students with high consumption have comparatively lower grades than
those with low consumption. The hypothesis for the variable
stands true that high alcohol consumption has negatively effected the
grades of students
About Logistic regression model
Logistic regression is used to predict the class (or category) based on one or multiple predictor variables (x). It is used to model a binary outcome, that is a variable, which can have only two possible values
We will now use Logistic regression model to identify factors related to higher than average student alcohol consumption.
# find the model with glm()
m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")
# print out a summary of the model
summary(m)
##
## Call:
## glm(formula = high_use ~ schoolsup + higher + famrel + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4957 -0.8206 -0.7292 1.3560 1.9346
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.83332 0.74905 2.448 0.0144 *
## schoolsupyes -0.37002 0.35929 -1.030 0.3031
## higheryes -0.78379 0.55077 -1.423 0.1547
## famrel -0.27320 0.12418 -2.200 0.0278 *
## G3 -0.07269 0.03692 -1.969 0.0490 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 437.36 on 365 degrees of freedom
## AIC: 447.36
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept) schoolsupyes higheryes famrel G3
## 1.83331647 -0.37001600 -0.78378621 -0.27320431 -0.07269305
Observation We can see that p-value for all the variables, is less than alpha(0.5) which means that with increase in one unit of alcohol consumption, there is a significant rise in support need from school and subsequently similar result can be seen for other variables i.e., there is significant rise in ‘higheryes’, ‘famrel’, and ‘G3’ respectively
There we 4 Fisher scoring iterations before the process of fitting the model stopped and output the results.
In this case, our intercept for different variable is less than 0, it matches with our p-value results observation being less than the alpha
# find the model with glm()
m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")
# compute odds ratios (OR)
OR <- coef(m) %>% exp
# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 6.2545955 1.4672230 28.1361910
## schoolsupyes 0.6907233 0.3292951 1.3616633
## higheryes 0.4566737 0.1504271 1.3477672
## famrel 0.7609373 0.5955510 0.9708981
## G3 0.9298862 0.8644588 0.9995539
Observation The OR value suggest that when one unit of alcohol consumption (high_use) increases, the odds of educational support decreases 21% and with one unit increase in schoolsup, odds of higher education decreases by 55% and odds of good fam decreases by 24% and by 8% for grades The Odds ratio of The final grade is 1:1:0:0
#Plot Visualisation by providing Confidence Intervals
library(finalfit)
dependent <- "high_use"
explanatory <- c("schoolsup", "higher", "famrel", "G3")
alc %>%
or_plot(dependent, explanatory,
breaks = c(0.5, 1, 2, 5),
table_text_size = 3.5,
title_text_size = 16)
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Waiting for profiling to be done...
## Warning: Removed 2 rows containing missing values (geom_errorbarh).
The results of the above plot matches the observations
#Binary prediction (1)
# fit the model
m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")
m1 <- glm(high_use ~ famrel + G3, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
prob2 <- predict(m1, type = "response")
library(dplyr)
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prob2 = prob2)
alc <- mutate(alc, prediction = probability > 0.5)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prob3 = prob2 > 0.5)
# see the last ten original classes, predicted probabilities, and class predictions
select(alc, schoolsup, higher, famrel, G3, high_use, probability, prediction) %>% tail(10)
## # A tibble: 10 × 7
## schoolsup higher famrel G3 high_use probability prediction
## <chr> <chr> <dbl> <dbl> <lgl> <dbl> <lgl>
## 1 no yes 4 13 FALSE 0.271 FALSE
## 2 no yes 4 0 FALSE 0.489 FALSE
## 3 no yes 5 2 TRUE 0.387 FALSE
## 4 no yes 5 12 FALSE 0.233 FALSE
## 5 no yes 4 8 FALSE 0.349 FALSE
## 6 no yes 5 5 FALSE 0.336 FALSE
## 7 no yes 4 12 FALSE 0.286 FALSE
## 8 no yes 1 4 FALSE 0.619 TRUE
## 9 no yes 2 13 TRUE 0.391 FALSE
## 10 no yes 4 10 TRUE 0.316 FALSE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 251 8
## TRUE 103 8
Observation The model has predicted False as 251/259 The model has predicted True as 8/111 The sensitivity of the model is 8 which is not so good and specificity is 251 which is much better
In the cross table, 8 times the prediction is heavy alcohol consumption when the variable gets a value corresponding to no heavy alcohol consumption. Similarly, 103 times the prediction is no high alcohol consumption when the variable gets a value corresponding to high alcohol consumption.
# rerunning some codes to avoid errors
m <- glm(high_use ~ schoolsup + higher + famrel + G3, data = alc, family = "binomial")
alc <- mutate(alc, probability = predict(m, type = "response"))
alc <- mutate(alc, prediction = probability > 0.5)
#Binary prediction (2)
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use))
# define the geom as points and draw the plot
g + geom_point() + aes(col = probability)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.67837838 0.02162162 0.70000000
## TRUE 0.27837838 0.02162162 0.30000000
## Sum 0.95675676 0.04324324 1.00000000
Observation The table above shows that, according to the forecast, there are about 95 percent of all people who do not drink a lot of alcohol. There are approximately 70 percent of all persons who do not drink much alcohol. Note again that the forecast differs from the values of the variable.
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, alc$probability)
## [1] 0.3
Observation The loss function gave the prediction error of 0.3
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.3081081
Observation Since the lower value means better this gives a comparatively poor result than the exercise result which was 0.26
To Improve the model,The model’s explanatory variables can be included, and the cross-validation can then be recalculated. For this I am adding the variable sex as explanatory variable as not only it relates strongly with target variable but also it will be interesting to see it from the point of view of gender
n <- glm(high_use ~ schoolsup + higher + famrel + sex, data = alc, family = "binomial")
# Summary of the model
summary(n)
##
## Call:
## glm(formula = high_use ~ schoolsup + higher + famrel + sex, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4855 -0.8633 -0.6683 1.2409 1.9794
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.7570 0.7207 1.050 0.293510
## schoolsupyes -0.0982 0.3675 -0.267 0.789324
## higheryes -0.8484 0.5327 -1.593 0.111222
## famrel -0.3235 0.1264 -2.559 0.010489 *
## sexM 0.9137 0.2427 3.765 0.000167 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 426.62 on 365 degrees of freedom
## AIC: 436.62
##
## Number of Fisher Scoring iterations: 4
#Calculations of the odds of ratios and the confidence interval
cbind(exp(coef(n)), exp(confint(n)))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 2.1319589 0.5214475 8.9591639
## schoolsupyes 0.9064662 0.4264470 1.8220117
## higheryes 0.4280849 0.1454536 1.2158699
## famrel 0.7236339 0.5634688 0.9265524
## sexM 2.4935013 1.5567858 4.0385316
# Prediction of the probability
probabilities_2 <- predict(n, type = "response")
# Probabilities to alc
alc <- mutate(alc, probability_2 = probabilities_2)
# Calculate a logical high use value based on probabilities.
alc <- mutate(alc, prediction_2 = probability_2 > 0.5)
# Recalculation of cross-validation and print the average number of wrong predictions.
cv_2 <- cv.glm(data = alc, cost = loss_func, glmfit = n, K = 10)
cv_2$delta[1]
## [1] 0.2972973
Interpretation The result of 0.2891892 is much better than the 0.3108108 which was the result of original model
**____End of Chapter___**
date()
## [1] "Fri Nov 25 15:29:16 2022"
About the Data set
Boston data set has 506 observations and 14 variables. It is one of the pre installed data sets that comes with the Mass Package. It contains information regarding the Housing values in suburbs of Boston.
The variables involved are as follows:
crim - per capita crime rate by town.
zn - proportion of residential land zoned for lots over 25,000 sq.ft.
indus - proportion of non-retail business acres per town.
chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox - nitrogen oxides concentration (parts per 10 million).
rm - average number of rooms per dwelling.
age - proportion of owner-occupied units built prior to 1940.
dis - weighted mean of distances to five Boston employment centres.
rad - index of accessibility to radial highways.
tax - full-value property-tax rate per $10,000.
ptratio - pupil-teacher ratio by town.
black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
lstat - lower status of the population (percent).
medv - median value of owner-occupied homes in $1000s.
Source Harrison, D. and Rubinfeld, D.L. (1978) Hedonic prices and the demand for clean air. J. Environ. Economics and Management 5, 81–102. Belsley D.A., Kuh, E. and Welsch, R.E. (1980) Regression Diagnostics. Identifying Influential Data and Sources of Collinearity. New York: Wiley.
#Loading required libraries for the Exercise
library(ggplot2)
library(dplyr)
library(corrplot)
## corrplot 0.92 loaded
library(GGally)
library(tidyr)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#Loading Data Set
data("Boston")
#Exploring the Data Set
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Observation: The data set has 506 Observations of 14 variables The variables chas and rad are integers whiles others are numerical. From the function dimensions, from the Min. and Max range we can see that the variable distribution is skewed except for the variables rm, and medv.
# plot matrix of the variables
pairs(Boston)
# printing the correlation matrix to look at the correlations between variables in the data
cor_matrix <- cor(Boston) %>% round(digits = 2)
# visualize the correlation matrix
library(corrplot)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6,)
Observations
Pair Plot: The pair plot is the visual representation of the observation as mentioned before this
Correlation Plot: According to the Correlation plot, we can say that there many variables with high positive correlation for example the variables ‘rad’ and ‘tax’ have the highest positive correlation among all the variables. On the other hand, we can also see strong negative correlation between many variable, highest negative correlation being among the variable like ‘lstat’ and ‘medv’, ‘age’ and ‘dis’, ‘nox’ and ‘dis’, and, ‘indus’ and ‘dis’. Therefore, we can also say that ‘dis’ (distance) has the most highest negative correlation among all the other variables
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
Observation We can see that now all variables are normally distributed and because of the standardization process, it is done in such a way that mean of all variables is equal to zero
2.Creating a categorical variable of the crime rate in the Boston data set
# Create a quantile vector of crim and create the categorical "crime".
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c('low','med_low','med_high','high'))
# Remove the original unscaled variable and add the new categorical value to scaled data.
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
# Look at the table of the new factor crime
table(boston_scaled$crim)
##
## low med_low med_high high
## 127 126 126 127
Observation We have now created a categorical variable(previously continuous) and changed the name from crim to crime which has now been added as a separate variable in the boston_scaled data set which has 127 observation for low crime, 126 for medium to low and medium to high crime and 127 for high crime
When we want to use a statistical method to predict something, it is important to have data to test how well the predictions fit. Splitting the original data to test and train sets allows us to check how well our model works. The training of the model is done with the train set and prediction on new data is done with the test set. This way you have true classes / labels for the test data, and you can calculate how well the model performed in prediction.
Below the size = n x 0.8 representation means that it will randomly choose the 80% of the data to create a training data set
# Number of rows in the Boston dataset
n <- nrow(boston_scaled)
# Choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# Create train set
train <- boston_scaled[ind,]
# Create test set
test <- boston_scaled[-ind,]
# Save the correct classes from the test data
correct_classes <- test$crime
# Remove the crime variable from the test data
test <- dplyr::select(test, -crime)
LDA is done to find the linear combination among the variable which separates the classes of target variables For this analysis, we will use ‘Crime’ as the target variable
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2623762 0.2326733 0.2376238 0.2673267
##
## Group means:
## zn indus chas nox rm age
## low 0.97543647 -0.8977735 -0.08661679 -0.8852680 0.46287527 -0.8775245
## med_low -0.06462671 -0.3262614 0.02085925 -0.5749309 -0.09190006 -0.3426089
## med_high -0.39612627 0.2502745 0.17879700 0.4436498 0.05460387 0.3699489
## high -0.48724019 1.0169921 -0.09005591 1.0622579 -0.36757432 0.8271536
## dis rad tax ptratio black lstat
## low 0.8824996 -0.6958379 -0.7677688 -0.45095283 0.37857765 -0.77564271
## med_low 0.3438926 -0.5505852 -0.5134479 -0.04678098 0.31754146 -0.15753654
## med_high -0.3923216 -0.3980671 -0.2845421 -0.32396516 0.06635014 0.05756895
## high -0.8571271 1.6393984 1.5149640 0.78225547 -0.78397836 0.88251893
## medv
## low 0.5420313
## med_low 0.0346041
## med_high 0.1663234
## high -0.7215869
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08447768 0.684782154 -0.84449575
## indus 0.01780437 -0.235238688 0.19137881
## chas -0.08093902 -0.046756104 0.13910844
## nox 0.34259248 -0.758739887 -1.33895099
## rm -0.06157118 -0.043640551 -0.16232800
## age 0.31702220 -0.209611717 -0.09007108
## dis -0.05161374 -0.231216406 0.03770089
## rad 3.10420071 1.043655642 -0.14326266
## tax 0.10288507 -0.137064991 0.54008441
## ptratio 0.11563945 0.007061256 -0.10758522
## black -0.11256833 0.022090876 0.12254221
## lstat 0.25126096 -0.270506395 0.30521965
## medv 0.18675535 -0.452447724 -0.22984616
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9547 0.0353 0.0100
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(factor(train$crime))
# plot the lda results
plot(lda.fit, col = classes, pch = classes, dimen = 2)
lda.arrows(lda.fit, myscale = 1)
Observation The proportion of trace tells us that LD1 explains variance up-to 95%, LD2 does it up to 3.58% and LD3 does it for 1.07%. Through the plot we can say that the target variable crime is clearly separated, and the variable accessibility to ‘rad’ is optimal as seen through the arrow.
(Please Note:The saving of crime categories and removal of the categorical crime variable from the test data set can be found just before the start of LDA exercise in the Divide the data set to train and test sets exercise)
# Predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# Cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 8 1 0
## med_low 4 21 7 0
## med_high 0 13 16 1
## high 0 0 0 19
Observation The cross tabulation suggests that the model predicts the ‘high’ crime rate quite well. There are 102 total observation and The majority of the predicted values are predicted as correct values for the same category making the model usable for rough predictions.
# Reload Boston dataset.
data("Boston")
# center and standardize variables
boston_scaled <- scale(Boston)
# change the object to data frame from matrix type.
boston_scaled <- as.data.frame(boston_scaled)
# Calculate the Euclidean distances between observations.
dist_eu <- dist(boston_scaled, method = "euclidean")
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Observation The most normal distance measure is Euclidean, according to which the distance range is between 0.1343 to 14.3970, with a mean of 4.9111 and median of 4.8241
# K-means clustering
km <-kmeans(boston_scaled, centers = "3")
# plot the Boston dataset with clusters
pairs(boston_scaled[6:10], col = km$cluster)
Observation As I have clustered the data into 5 columns and definde k-means as 3 cluster
We will now find the optimal number for clusters
# Investigate optimal number of clusters and rerun algorithm
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})
# Visualize the cluster.
qplot(x = 1:k_max, y = twcss, geom = 'line')
Observation We can see that optimal number drops radically and with this we can see that after every two cluster, there seems to be a change.
Let us now check how does it look with optimal number 2
# k-means clustering
km <-kmeans(dist_eu, centers = 2)
# Plot the clusters
pairs(boston_scaled[6:10], col = km$cluster)
Observation We can see that because of our previous observation was of 3 clusters and optimal is 2, there isnt much difference but going by the results, ‘two’ seems to be the optimal number for clustering the data, when the method chosen is Euclidean
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Plot.
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
# k-means with 4 clusters to compare the methods.
km3D <-kmeans(boston_scaled, centers = 4)
#Plot with km cluster as color of data
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km3D$cluster[ind])
Observation
Through the above super cool 3D plots, my observation is that the k-means clustering works better. In the crime plot, it was hard to establish which variables are falling where as some of them got a bit submerged in each other.
**____End of chapter___**
(more chapters to be added similarly as we proceed with the course!)